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Application of nonparametric and semiparametric regression tech- 
niques to high-dimensional time series data has been hampered due 
to the lack of effective tools to address the "curse of dimensionality." 
Under rather weak conditions, we propose spline-backfitted kernel es- 
timators of the component functions for the nonlinear additive time 
series data that are both computationally expedient so they are us- 
able for analyzing very high-dimensional time series, and theoretically 
reliable so inference can be made on the component functions with 
confidence. Simulation experiments have provided strong evidence 
that corroborates the asymptotic theory. 



1. Introduction. For the past three decades, various nonparametric and 
semiparametric regression techniques have been developed for the analysis 
of nonlinear time series; see, for example, [14, 21, 25], to name one article 
representative of each decade. Application to high-dimensional time series 
data, however, has been hampered due to the scarcity of smoothing tools 
that are not only computationally expedient but also theoretically reliable, 
which has motivated the proposed procedures of this paper. 

In high-dimensional time series smoothing, one unavoidable issue is the 
"curse of dimensionality," which refers to the poor convergence rate of non- 
parametric estimation of general multivariate functions. One solution is re- 
gression in the form of an additive model introduced by [9] : 
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in which the sequence {Yi, X?}" =1 = {Yi, Xn, . . . , Xid}f =1 is a length- ra real- 
ization of a (d + l)-dimensional time series, the d-variate functions m and a 
are the mean and standard deviation of the response Yi conditional on the 
predictor vector Xj = {Xn, . . . , X i( i) T , and each e, is a white noise condi- 
tional on Xj. In a nonlinear additive autoregression data-analytical context, 
each predictor Xi a , 1 < a < d, could be observed lagged values of Yi, such 
as Xi a = Yi_ a , or of an exogenous time series. Model (1.1), therefore, is 
the exact same nonlinear additive autoregression model of [14] and [2] with 
exogenous variables. For identifiability, additive component functions must 
satisfy the conditions Em a (Xi a ) = 0, a = 1, . . . , d. 

We propose estimators of the unknown component functions {?ti q ,(-)} q ,_^ 
based on a geometrically a-mixing sample {Y^, Xn, . . . , Xid}2=\ following 
model (1.1). If the data were actually i.i.d. observations instead of a time se- 
ries realization, many methods would be available for estimating {m a (-)}„ =1 . 
For instance, there are four types of kernel-based estimators: the classic 
backfitting estimators (CBE) of [9] and [19]; marginal integration estima- 
tors (MIE) of [6, 16, 17, 22, 30] and a kernel-based method of estimating 
rate to optimality of [10]; the smoothing backfitting estimators (SBE) of 
[18]; and the two-stage estimators, such as one step backfitting of the inte- 
gration estimators of [15], one step backfitting of the projection estimators 
of [11] and one Newton step from the nonlinear LSE estimators of [12]. For 
the spline estimators, see [13, 23, 24] and [28]. 

In the time series context, however, there are fewer theoretically justified 
methods due to the additional difficulty posed by dependence in the data. 
Some of these are the kernel estimators via marginal integration of [25, 29], 
and the spline estimators of [14]. In addition, [27] has extended the marginal 
integration kernel estimator to additive coefficient models for weakly depen- 
dent data. All of these existing methods are unsatisfactory in regard to either 
the computational or the theoretical issue. The existing kernel methods are 
too computationally intensive for high dimension d, thus limiting their ap- 
plicability to a small number of predictors. Spline methods, on the other 
hand, provide only convergence rates but no asymptotic distributions, so no 
measures of confidence can be assigned to the estimators. 

If the last d — 1 component functions were known by "oracle," one could 
create {Y a , Xu}? =1 withl^ = Y i -c-Y Ja =2 m a (X ia ) = mi(X il )+a(X il , . . . , 
Xid)£i, from which one could compute an "oracle smoother" to estimate 
the only unknown function mi(xi), thus effectively bypassing the "curse of 
dimensionality." The idea of [15] was to obtain an approximation to the un- 
observable variables Yn by substituting m a (Xi a ), i = 1, . . . ,ra, a = 2, . . . ,d, 
with marginal integration kernel estimates and arguing that the error in- 
curred by this "cheating" is of smaller magnitude than the rate 0(n~ 2 / 5 ) 
for estimating the function m\(xi) from the unobservable data. We modify 
the procedure of [15] by substituting m a (Xi a ), i = 1, . . . ,n, a = 2, . . . , d, with 
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spline estimators. Specifically, we propose a two-stage estimation procedure; 
first we pre-estimate {m a (x Q )}^ =2 by its pilot estimator through an under- 
smoothed centered standard spline procedure; next we construct the pseudo 
response Yn and approximate mi{x\) by its Nadaraya- Watson estimator in 
(2.12). 

The above proposed spline-backfitted kernel (SPBK) estimation method 
has several advantages compared to most of the existing methods. First, as 
pointed out in [22], the estimator of [15] mixed up different projections, mak- 
ing it uninterpretable if the real data generating process deviates from addi- 
tivity, while the projections in both steps of our estimator are with respect 
to the same measure. Second, since our pilot spline estimator is thousands 
of times faster than the pilot kernel estimator in [15], our proposed method 
is computationally expedient; see Table 2. Third, the SPBK estimator can 
be shown to be as efficient as the "oracle smoother" uniformly over any 
compact range, whereas [15] proved such "oracle efficiency" only at a single 
point. Moreover, the regularity conditions in our paper are natural and ap- 
pealing and close to being minimal. In contrast, higher-order smoothness is 
needed with growing dimensionality of the regressors in [17]. Stronger and 
more obscure conditions are assumed for the two-stage estimation proposed 
by [12]. 

The SPBK estimator achieves its seemingly surprising success by borrow- 
ing the strengths of both spline and kernel: the spline does a quick initial 
estimation of all additive components and removes them all except the one 
of interest; kernel smoothing is then applied to the cleaned univariate data 
to estimate with asymptotic distribution. Propositions 4.1 and 5.1 are the 
keys in understanding the proposed estimators' uniform oracle efficiency. 
They accomplish the well-known "reducing bias by undersmoothing" in the 
first step using spline and "averaging out the variance" in the second step 
with kernel, both steps taking advantage of the joint asymptotics of kernel 
and spline functions, which is the new feature of our proofs. 

Reference [7] provides generalized likelihood ratio (GLR) tests for additive 
models using the backfitting estimator. A similar GLR test based on our 
SPBK estimator is feasible for future research. 

The rest of the paper is organized as follows. In Section 2 we introduce the 
SPBK estimator and state its asymptotic "oracle efficiency" under appropri- 
ate assumptions. In Section 3 we provide some insights into the ideas behind 
our proofs of the main results, by decomposing the estimator's "cheating" 
error into a bias and a variance part. In Section 4 we show the uniform or- 
der of the bias term. In Section 5 we show the uniform order of the variance 
term. In Section 6 we present Monte Carlo results to demonstrate that the 
SPBK estimator does indeed possess the claimed asymptotic properties. All 
technical proofs are contained in the Appendix. 



4 



L. WANG AND L. YANG 



2. The SPBK estimator. In this section we describe the spline-backfitted 
kernel estimation procedure. For convenience, we denote vectors as x = 
(xi, . . . ,Xd) and take || • || as the usual Euclidean norm on R d , that is, ||x|| = 



Sa=i x o> an d || • ||oo the sup norm, that is, ||x||oo = sup 1<a<rf \x a \. In what 
follows, let Yi and Xj = (Xn, . . . ,X ic i) T be the zth response and predictor 
vector. Denote by Y = (Y±, . . . , Y n ) T the response vector and (Xi, . . . , X n ) T 
the design matrix. 

Let {Yi,X.J}2 =1 = {Yi,Xn, . . . , X i( i}f =l be observations from a geometri- 
cally a-mixing process following model (1.1). We assume that the predictor 
X a is distributed on £i compact interval [o-q^q], ct = 1, . . . , d, and without 
loss of generality, we take all intervals [a a , b a ] = [0, 1], a = 1, . . . , d. We pre- 
select an integer TV" = N n ~ n 2 / 5 logn; see assumption (A6) below. Next, we 
define for any a = l,...,d the first-order B spline function ([3], page 89), 
or say the constant B spline function is the indicator function Ij )a {x a ) of 
the N + 1 equally spaced subintervals of the finite interval [0, 1] with length 
H = H n = (N + l)- 1 , that is, 

(2.1) Ij, a (x a ) = \l> JH<x a <(J+l)H, J = I N. 
' [0, otherwise, 

Define the following centered spline basis: 

b.J, a (x a ) = I J+ i. a (x a ) - ^ J r +1 '^ 2 Jj, a (l ft ) 

Jj,o 2 

(2.2) 

Va = 1, . . . ,d, J = 1, . . . , N, 
with the standardized version given for any a = 1, . . . , d, 

(2.3) Bj >a ( Xa ) = b -^± VJ = l,...,iV. 

Define next the (1 + dN) -dimensional space G = G[0, 1] of additive spline 
functions as the linear space spanned by {l,Bj }0l (x a ), a = l, . . . ,d, J = 1, ... , 
N}, and denote by G n C R n the linear space spanned by {1, {Bj >a (Xi a )}2 =1 , 
a = 1, . . . , d, J = 1, . . . j iV}. As n — > oo, the dimension of G n becomes 1 + dN 
with probability approaching 1. The spline estimator of the additive function 
m(x) is the unique element m(x) = m n (x) from the space G so that the 
vector {m(Xi), . . . ,m(X n )} r best approximates the response vector Y. To 
be precise, we define 

d N 

(2-4) m(x) = A' + J2 E *J,M*«), 

a = l .1 = 1 
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where the coefficients (Ag, A' 1:L , . . . , A'^y d ) are solutions of the least squares 
problem 

n ( d N ~| 2 

{A , Ai i, . . . , \' N A T = argmin Y { - A - V V \j, a Ij, a (X ia ) \ . 

R dN+1 i=ll a=U=l J 

Simple linear algebra shows that 

d N 

(2-5) m(x) = A + Y X! ^J,otBj, 

Q=l J=l 

where (Ao, Ai ; i, . . . , \N,d) are solutions of the least squares problem 

n ( d N ~| 2 

{\q,\i,i, - ■ ■ , ~>>N,d} T =argminV< Y { - A - V V Xj, a B J)a (X ia ) \ ; 

R dN+1 i=i{ a =u=i J 

(2.6) 

while (2.4) is used for data-analytic implementation, the mathematically 
equivalent expression (2.5) is convenient for asymptotic analysis. 

The pilot estimators of each component function and the constant are 

N n N 

m a (x a ) = Y ^J, a Bj ta (x a ) - n^YYl \j, a Bj^(X ia ), 
J=i i=i J=i 

2.7 

d n N 

m c = X + n" 1 Y Y Y kj, a Bj,a{Xia)- 

a=li=l J=l 

These pilot estimators are then used to define new pseudo-responses Yn, 
which are estimates of the unobservable "oracle" responses Yn. Specifically, 

d d 
(2.8) Y a = Yi-c- ^m Q (X to ), Y a = Yi - c - Y m a{ x ia) , 

a=2 a=2 

where c = Y n = n -1 ^^!^, which is a -^/n-consistent estimator of c by 
the central limit theorem. Next, we define the spline-backfitted kernel es- 
timator of mi(xi) as rh\{xi) based on {Yn, Xn}f =1 , which attempts to 
mimic the would-be Nadaraya- Watson estimator fh\{xi) of m\{xi) based on 
{liH^ii}™=i if the unobservable "oracle" responses {Yii}f =1 were available: 

^)= r h Kh{Xa - xl)i }\ 

EUK h (X tl -x l} 

l(xi) " EtAp^i) ' 

where Yn and Yn are defined in (2.8). 



G 



L. WANG AND L. YANG 



Throughout this paper, on any fixed interval [a, b], we denote the space 
of second-order smooth functions as C^[a, b] = {m\m" G C[a,6]} and the 
class of Lipschitz continuous functions for any fixed constant C > as 
Lip([a,b],C) = {m\\m(x) — m(x')\ < C\x — x'\, \/x,x' G [a, b]}. 

Before presenting the main results, we state the following assumptions. 

(Al) The additive component function mi(xi) G C^[0, 1], while there is a 
constant < Coo < oo such that mp G Lip([0, 1], Coo), V/3 = 2, . . . ,d. 

(A2) There exist positive constants Kq and Ao such that a{n) < Koe~ x ° n 
holds for all n, with the a-mixing coefficients for {Zj = (X^,£j)}™ =1 
defined as 

(2.10) a(k)= sup \P{BnC)-P(B)P(C)\, k>l. 

Be<j{Z s ,s<t},C£a{Z s ,s>t+k} 

(A3) The noise e { satisfies £(ei|X») = 0, E(ej\Xi) = 1 and E(\si\ 2+s \Xi) < 
Ms for some 5 > 1/2 and a finite positive Ms and <r(x) is continuous 
on [0,l] d : 

< Cfj < inf cr(x) < sup <t(x) < C^ < oo. 

xe[0,l] d xG[0,l] d 

(A4) The density function /(x) of X is continuous and 

< Cf < inf /(x) < sup /(x) < Cf < oo. 

xG[0,l] d xG[0,l] d 

The marginal densities f a (x a ) of X a have continuous derivatives on 

[0, 1] as well as the uniform upper bound Cf and lower bound Cf. 
(A5) The kernel function G Lip([— 1, 1], C^) for some constant C& > 0, 

and is bounded, nonnegative, symmetric and supported on [—1,1]. 

The bandwidth h ~ n" 1 / 5 , that is, c^n" 1 / 5 < h < C^n~ 1//5 for some 

positive constants C/j, c^. 
(A6) The number of interior knots TV ~ n 2//5 logn, that is, CArn 2//5 logn < 
< CArn 2 / 5 logn for some positive constants cn, Cn- 

Remark 2.1. The smoothness assumption of the true component func- 
tions is greatly relaxed in our paper and we believe that our assumption (Al) 
is close to being minimal. By the result of [20], a geometrically ergodic time 
series is a strongly mixing sequence. Therefore, assumption (A2) is suitable 
for (1.1) as a time series model under the aforementioned assumptions. As- 
sumptions (A3)-(A5) are typical in the nonparametric smoothing literature; 
see, for instance, [5]. For (A6), the proof of Theorem 2.1 in the Appendix 
will make it clear that the number of knots can be of the more general form 
N ~ n 2 / 5 iV, where the sequence N' satisfies N' — > oo, n~ 9 N' — > for any 
6 > 0. There is no optimal way to choose N' as in the literature. Here we 
select iV to be of barely larger order than n 2//5 . 
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The asymptotic property of the kernel smoother rh\(x\) is well developed. 
Under assumptions (Al)-(A5), it is straightforward to verify (as in [1]) that 

sup \rfi\{x\) — mi{x\)\ = o p (n~ 2 ^ 5 log n), 
xiE[h,l— h] 

Vnh{ml(xi) — m\{xi) — bi(x\)h 2 } —> N{0, v 2 (x±)}, 

where 
(2.11) 

v 2 1 (x 1 )= J K 2 (u)duE[a 2 (X 1 ,...,X d )\X 1 =x 1 ]fi\x 1 ). 

The following theorem states that the asymptotic uniform magnitude of 
the difference between rh\{x\) and fh\(x\) is of order o p (n~ 2 / 5 ), which is 
dominated by the asymptotic uniform size of fh*(xi) — m\(xi). As a result, 
rh\(x{) will have the same asymptotic distribution as fh\{x\). 

Theorem 2.1. Under assumptions (Al)-(A6), the SPBK estimator rh*(xi) 
given in (2.9) satisfies 

sup \m\(x\) — fn[(xi)\ = o p {n~ 2 ^). 
xie[o,i] 

Hence with b\{x{) and v\{x{) as defined in (2.11), for any x\ E [h, 1 — h] 
Vnh{rhl(xi) — m\(xx) — bi{x\)h 2 } — » N{0, v 2 (xi)}. 

Remark 2.2. Theorem 2.1 holds for m* a (x a ) similarly constructed as 
m\(x\), for any a = 2, . . . ,d, that is, 

m a (x a ) - 



T2=iK h (Xa-x a ) 
(2.12) 

Y ia = Yi - c - 2J rhpiXip), 

l</3<d,(3=ia 

where mp(Xip), (3 = 1, . . . ,d, are the pilot estimators of each component 
function given in (2.7). Similar constructions can be based on a local polyno- 
mial instead of the Nadaraya- Watson estimator. For more on the properties 
of local polynomial estimators, in particular, their minimax efficiency, see 
[5]. 

Remark 2.3. Compared to the SBE in [18], the variance term v\(xi) 
is identical to that of SBE and the bias term bi(x±) is much more explicit 
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than that of SBE, at least when the Nadaraya-Watson smoother is used. 
Theorem 2.1 can be used to construct asymptotic confidence intervals. Un- 
der assumptions (Al)-(A6), for any a 6 (0, 1), an asymptotic 100(1 — a)% 
pointwise confidence interval for m\{xi) is 

(2.13) ml{x 1 )-b 1 {x 1 )h 2 ±z a/2 a 1 {x 1 ){j K 2 (u)duj / {nhh{xi)} l/2 , 

where &i{xi) and /i(xi) are estimators of E[a 2 (Xi, . . . , Xd)\Xi = x\] and 
h(xi). 

The following corollary provides the asymptotic distribution of m*(x). 
The proof of this corollary is straightforward and therefore omitted. 

Corollary 2.1. Under assumptions (A1)-(A6) and the additional as- 
sumption that m a (x a ) E [0, 1], a = 2, . . . ,d, for any x <E [0, l] d , the SPBK 
estimator m*(x),a = 1,. . . ,d , is defined as given in (2.12). Let 

d d d 

m*(x) = c+ rn* a (x a ), 6(x) = ^ b a (x a ), v 2 (x) = ^ v 2 a (x a ). 

a=l a=l a=l 

Then 

V^{m*(x) - m(x) - b{x)h 2 } 3 N{0,v 2 (x)}. 

3. Decomposition. In this section we introduce some additional notation 
to shed some light on the ideas behind the proof of Theorem 2.1. For any 
functions <fi, <p on [0, l] d , define the empirical inner product and the empirical 
norm as 

n n 
i=l i=l 

In addition, if the functions 4>, tp are L 2 -integrable, define the theoretical 
inner product and its corresponding theoretical L 2 norm as 

(<P, <p) 2 = EMXiMXi)}, UWl = E{<f> 2 (Xi)}. 

The evaluation of spline estimator m(x) at the n observations results 
in an n-dimensional vector, m(Xi, . . . ,X n ) = {m(Xi), . . . ,m(X n )} T , which 
can be considered as the projection of Y on the space G n with respect to 
the empirical inner product (•, -)2,n- In general, for any n-dimensional vector 
A ={Ai, . . . , A n } T , we define P n A(x) as the spline function constructed from 
the projection of A on the inner product space (G n , (•, -)-2 n ), that is, 

d N 

P n A(x) = A + ^^ Xj, a Bj, 

a=l J=l 



ADDITIVE AUTOREGRESSION MODEL 



9 



with the coefficients (Ao, Ai,i, . . . , Ajv,<i) given in (2.6). Next, the multivari- 
ate function P n A(x) is decomposed into the empirically centered additive 
components P n>Q A(x Q ), a = 1, . . . , d, and the constant component P n)C A: 



(3.1) 



P n , a A(x a ) = P; Q A(x a ) - n- 1 K, a MXi, 



i=l 



d n 



(3.2) P n , c A = A + 7^ 1 EE P ^A(X ia ), 

where P* a A(x a ) = J2j=i ^J,aBj )a (x a ). With this new notation, we can 
rewrite the spline estimators rh(x.),m a (x a ),m c defined in (2.5) and (2.7) as 

m(x) = P n Y(x), m a (x a ) = P n , a Y(x a ), m c = P„, jC Y. 

Based on the relation Yi = m(Xj) + cr(Xj)ej, one defines similarly the 
noiseless spline smoothers and the variance spline components, 

m(x) = P n {m(X)}(x), rh a (x a ) = P niQ {m(X)}(x Q ,), 

(3.3) 

m c = P niC {m(X)}, 
(3.4) e(x)=P„E(x), e a {x a ) =P n , a E(x Q ), e c = P n>c E, 

where the noise vector E = {<r(Xj)ej}™ =1 . Due to the linearity of the opera- 
tors P n , P n ,cj P n ,a, a = 1, . . . ,d, one has the crucial decomposition 

m(x) = m(x) + e(x), rh c = rh c + e c , 

( 3 - 5 ) 

77i Q (x Q ) — ?7l Q (x Q ) + S a {x a ), 

for a = 1, . . . ,d. As closer examination is needed later for e(x) and £ a (x a ), 
we define in addition a = {clq, an, . . . ,aN,d} T as the minimizer of 



d N 



(3.6) 



J2{ (TpQet-ao- J2 J2 



i=l 



«=1J=1 



Then e(x) = a B(x), where the vector B(x) and matrix B are defined as 



(3.7) B(x) = {l,B l , l (x l ),...,B N4 (x d )} 1 , B = {B(X!),...,B(X n )} 7 . 
Thus a = (B T B)~ 1 B T E is the solution of (3.6) and specifically a is equal to 



1 



(3.8) 



OdN {Bj a ,Bji a i)2, n )^<a,a'<d, 

1<J,J'<N 



1 n 

i=l 

n 

E & j^ a (X ia )G(JLi)Ei 
i=l 



X < 



11 



1<J<N, 
l<a<d 
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where O p is a p- vector with all elements 0. 

Our main objective is to study the difference between the smoothed back- 
fitted estimator rh\{x{) and the smoothed "oracle" estimator rh\{xi), both 
given in (2.9). From now on, we assume without loss of generality that d = 2 
for notational brevity. Making use of the definition of c and the signal and 
noise decomposition (3.5), the difference rn\{xi) — rh\{xi) — c + c can be 
treated as the sum of two terms, 

l/nYS=iK h (X il -x l ){m 2 (X l2 )-m 2 {X i2 )} 

^fe(xi) +^ v (xi) 

~ l/n£^i*fc(*ii-zi)' 

where 

1 n 

(3.10) V h {x l ) = -Y j K h (X il -x 1 ){m 2 (X i2 )-m 2 (X i2 )}, 

1 n 

(3.11) * w (xi) = - V^(X a -xi)£ 2 (X i2 ). 

The term ^fe(xi) is induced by the bias term rr^A^) — 1^2(^2), while 
^ v (xi) is related to the variance term e 2 {Xi 2 ). Both of these two terms 
have order o p (n -2 / 5 ) by Propositions 4.1 and 5.1 in the next two sections. 
Standard theory of kernel density estimation ensures that the denominator 
in (3.9), n" 1 J2?=i ^OiPQi — x i)> nas a positive lower bound for x\ G [0,1]. 
The additional nuisance term c — c is clearly of order O p (n~ 1 ^ 2 ) and thus 
o p (n -2 / 5 ), which needs no further arguments for the proofs. Theorem 2.1 
then follows from Propositions 4.1 and 5.1. 

4. Bias reduction for ^5(2:1). In this section we show that the bias term 
^?b( x i) of (3.10) is uniformly of order o p (ra~ 2 / 5 ) for x\ 6 [0, 1]. 

PROPOSITION 4.1. Under assumptions (Al), (A2) and (A4)-(A6), 

sup |* 6 (xi)| = Opin- 1 / 2 + H)= o p (n- 2 / 5 ). 
xie[o,i] 

Lemma 4.1. Under assumption (Al), there exist functions g±, g 2 G G, 
such that 

2 

rh- g+ ^2(l,g a {X a )) 2tn 

a=l 

where g(x) = c + J2a=i 9a(%a) an d rh is defined in (3.3). 



= O p {n~ 1 ' 2 + H), 

2.n 
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Proof. According to the result on page 149 of [3], there is a con- 
stant Coo > such that for the function g a £ G , \\g a — m a ||oo < CooH, 
a = 1,2. Thus \\g - toIIoo < E«=i \\9a ~ m a \\oo < IC^H and \\rh - m|| 2 , n < 
US' - m|| 2 ,n < ICooH. Noting that \\fh - g\\ 2 , n < \\rh - m|| 2 , n + \\g - m\\ 2 , n < 
iCooH, one has 

\(g a (X a ),l) 2 ,n\ < \ (l,9a(X a ))2,n ~ (^,m a (X a )) 2n \ + \(l,m a (X a )) 2n \ 

(4.1) 

■CC^H + O^n" 1 / 2 ). 

Therefore 



m 



9+ ^2{l,9a{X a ))2,n 



a=l 



< \\m - g\\ 2>n + \( l i9a{X a )) 2 , n \ 



2.11 



a=l 



< QC^H + Opin" 1 ' 2 ) = O p {n- l l 2 + H). 



□ 



Proof of Proposition 4.1. Denote 

E? =1 K h (X a - Xl ){g 2 {X i2 ) - m 2 (X i2 )} 



Ri = sup 

xie[o,i] 

R 2 = sup 

xi6[0,l] 



YZ^KhiXa-xt) 
E?=i ^(^i " *l){m 2 (X i2 ) - g 2 (X i2 ) + (1, <7 2 (X 2 )) 2 , n } 



EIU^PQi-zi) 



then sup a . l6 [ ,i] |^b(^i)| < I (1, <72(-^2))2,n| + -Ri + R2- For using the result 
on page 149 of [3], one has i?i < CooH. To deal with R 2 , let Bj 2 (x a ) = 
B j )2 (x a ) — (1, Bj :2 (X a )) 2)n , for J = 1, . . . ,N, a = 1,2; then one can write 



m 



2 2 N 

x) - <?(x) + £ (1, <? a (X Q )) 2 , n = a* + ^ E 5 J, 
a=l a =lJ=l 



Thus, n 1 T2=l K h( x n ~ xi){m 2 (X i2 ) - g 2 {X i2 ) + (1, £ 2 pf 2 )) 2>n } can be 
rewritten as n -1 E?=i Kh(Xil ~ x i) 2~2j=i ^j 2 Bj 2 (Xi 2 ), bounded by 

A? 



J2 \^h\ su p 



j=i 



1<J<N 



n 



- 1 J2K h (X il -x 1 )Bl 2 (X,, 



i2, 



i=l 



N 



J=l 



1<J<N 



11 



i=l 



71,1 



n 



i=l 



where A nj i = sup Ja | (1, Bj ja ) 2>n — (l,B M ) 2 \=O p (n- 1 / 2 log n) as in (A. 12) 
and wj(Xj,xi) is in (5.5) with mean fj, Uj (xi). By Lemma A. 3 



sup sup 

xiG[0,l] !<J<iV 



n 
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< sup sup 

2i£[0,l] 1<J<N 



^Wj(Xj,Xi) - [J,uj(xi) 



n t , 

i=i 



+ sup sup \^ u)J {xi)\ 
aie[o,i] i<J<Af 

= O p (log n/v^ft) + O p (F 1 / 2 ) = P (if 1 / 2 ). 

Therefore, one has 

n 

-^KhiXa- xi){fh 2 (X i2 ) - g 2 (X i2 ) + (l,g 2 (X 2 )) 2 , n } 

1/2 



sup 

il€[0,l] 



n 



i=i 

N 



<{Nj2(*h) 2 j {0 P {H l / 2 ) + 0. 

2 

9 + ^2(i,g a {x a )} 2in 



log?! 



o 



Or, 



m 



m 



a=l 
2 



9 + J2( 1 ^9a{X a )) 2in 



a=l 



2,n/ 



where the last step follows from Lemma A. 8. Thus, by Lemma 4.1, 

(4.2) R 2 = O p (n^ 2 + H). 

Combining (4.1) and (4.2), one establishes Proposition 4.1. □ 

5. Variance reduction for ^ v {x\). In this section we will see that the 
term ^/ v (xi) given in (3.11) is uniformly of order o p (n~ 2 / 5 ). This is the 
most challenging part to be proved, mostly done in the Appendix. Define 
an auxiliary entity 



N 



(5.1) 



E *2 = H ^J,-i B J,2{x2), 
J=l 



where dj i2 is given in (3.8). Definitions (3.1) and (3.2) imply that £2(^2) 
defined in (3.4) is simply the empirical centering of £ 2 (x 2 ), that is, 



(5.2) 



e 2 (x 2 ) = e 2 (x 2 ) - n 1 ^££(Xi 2 ). 



8=1 



Proposition 5.1. Under assumptions (A2)-(A6), one has 

sup \y v (x 1 )\=O p (H) = o p (n- 2 / 5 ). 
xi e [0,1] 
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According to (5.2), we can write ^ v (x\) = (xi) — ^{}\xi), where 

n n 

(5.3) (xi) = n- 1 ]T 2^(X a - x x ) ■ n- 1 £ 
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l=i 



i=l 



(5.4) 



*( 2 )(x 1 )=n- 1 ^^(X u -x 1 )^(^ 2 ), 
l=i 

in which ^(-X^) is given in (5.1). Further one denotes 

(5.5) u J (X l ,xi) = K h (Xii-x 1 )Bj )2 (Xi2), /j, Uj (xi)=Eujj(Xi,xi). 

(2) 

By (3.8) and (5.1), \H (^1) can be rewritten as 

n N 

(5.6) *l 2) (xi) = n- 1 ^ E Sj,2Wj(Xi,xi). 

2=1 J=i 

The uniform order of ^^(xi) and ^^(xi) is given in the next two lem- 



mas. 



Lemma 5.1. Under assumptions (A2)-(A6), ^^\x\) in (5.3) satisfies 



sup \^\x x )\ = O p {N (log nf/n}. 
sie[o,i] 



Proof. Based on (5.1), 



i=i 



J=l 



• sup 

1<J<N 



)1 



i=\ 



Lemma A. 6 implies that 

N 



J=l 



< 



1/2 



< {Na T a}^ 2 = O p (A^n" 1 / 2 logn). 



By (A.12), su Pl 
(5.7) 



<J<N 



11 



- 1 E?=i Bj )2 (Xa)| < = P (n~ 1/2 logn), so 



-£^2) =O p {N{\ogn) 2 /n}. 



n t , 



By assumption (A5) on the kernel function .fT, standard theory on kernel 
density estimation entails that sup Xlg [ ,i] \ n ~ l J2?=i Kh{Xn — x\)\ = O p (l). 
Thus with (5.7) the lemma follows immediately. □ 

(2) 

Lemma 5.2. Under assumptions (A2)-(A6), ^i, (x\) in (5.4) satisfies 

sup \*&>( Xl )\=O p (H). 

xi6[0,l] 
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Lemma 5.2 follows from Lemmas A. 10 and A. 11. Proposition 5.1 follows 
from Lemmas 5.1 and 5.2. 

6. Simulation example. In this section we carry out two simulation ex- 
periments to illustrate the finite-sample behavior of our SPBK estimators. 
The programming codes are available in both R 2.2.1 and XploRe. For in- 
formation on XploRe, see [8] or visit www.xplore-stat.de. 

The number of interior knots N for the spline estimation as in (2.6) will 
be determined by the sample size n and a tuning constant c. To be precise, 

N = min([cn 2/5 logn] + 1, [(n/2 - l)^ -1 ]), 

in which [a] denotes the integer part of a. In our simulation study, we have 
used c = 0.5, 1.0. As seen in Table 1, the choice of c makes little difference, 
so we always recommend to use c = 0.5 to save computation for massive 
data sets. The additional constraint that N < (n/2 — l)^" 1 ensures that the 
number of terms in the linear least squares problem (2.6), 1 + dN, is no 
greater than n/2, which is necessary when the sample size n is moderate 
and the dimension d is high. 

We have obtained for comparison both the SPBK estimator m* (x a ) and 
the "oracle" estimator rh* a (x a ) by Nadaraya- Watson regression estimation 
using a quartic kernel and the rule-of-thumb bandwidth. 

We consider first the accuracy of the estimation, measured in terms of 
mean average squared error. Then to see that the SPBK estimator m* (x a ) 
is as efficient as the "oracle smoother" m*(x a ), we define the empirical 
relative efficiency of rh* a (x a ) with respect to in* a (x a ) as 

_ -J^ =1 {rh* a (X ia )-m a (X ia )} 2 ^ 2 



J% =1 {rh%(X ia )-m a (X ioi )Y_ 

Theorem 2.1 indicates that the efio, should be close to 1 for all a = 1, . . . , d. 
Figure 2 provides the kernel density estimations of the above empirical effi- 
ciencies to observe the convergence. 

Example 6.1. A time series {^KJifiggg is generated according to the 
nonlinear additive autoregression model with sine functions given in [2], 

Y t = 1.5sinfet_ 2 J -1.0sinfet_ 3 ) +a e t , do = 0.5, 1.0, 

where {^t^^-igge are i.i.d. standard normal errors. Let Xj" = {Y t -\, Y t ~2, ^-3}- 
Theorem 3 on page 91 of [4] establishes that {Y t ,Xf 

1996 i s geometrically 

ergodic. The first 2000 observations are discarded to make {It}™^ 3 behave 
like a geometrically a-mixing and strictly stationary time series. The multi- 
variate datum {Yt, Xj"}™^ 3 then satisfies assumptions (Al) to (A6) except 
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Table 1 
Report of Example 6. 1 



Component #1 



Component #2 



Component #3 









1st stage 


2nd stage 


1st stage 


2nd stage 


1st stage 


2nd stage 






0.5 


0.1231 


0.0461 


0.1476 


0.0645 


0.1254 


0.0681 




100 


1.0 


0.1278 


0.0520 


0.1404 


0.0690 


0.1318 


0.0726 






0.5 


0.0539 


0.0125 


0.0616 


0.0275 


0.0577 


0.0252 




200 


1.0 


0.0841 


0.0144 


0.0839 


0.0290 


0.0848 


0.0285 


0.5 




0.5 


0.0263 


0.0031 


0.0306 


0.0107 


0.0278 


0.0102 




500 


1.0 


0.0595 


0.0044 


0.0578 


0.0115 


0.0605 


0.0119 






0.5 


0.0169 


0.0015 


0.0210 


0.0053 


0.0178 


0.0054 




1000 


1.0 


0.0364 


0.0018 


0.0367 


0.0054 


0.0375 


0.0059 






0.5 


0.3008 


0.0587 


0.3298 


0.1427 


0.3236 


0.1393 




100 


1.0 


0.3088 


0.0586 


0.3369 


0.1364 


0.3062 


0.1316 






0.5 


0.1742 


0.0256 


0.1783 


0.0802 


0.1892 


0.0701 




200 


1.0 


0.2899 


0.0328 


0.2830 


0.0824 


0.3043 


0.0721 


1.0 




0.5 


0.0924 


0.0065 


0.1124 


0.0421 


0.1004 


0.0345 




500 


1.0 


0.2299 


0.0078 


0.2305 


0.0458 


0.2314 


0.0362 






0.5 


0.0616 


0.0033 


0.0637 


0.0270 


0.0646 


0.0224 




1000 


1.0 


0.1460 


0.0034 


0.1433 


0.0275 


0.1429 


0.0219 



Monte Carlo average squared errors (ASE) based on 100 replications. 



that instead of being [0,1], the range of Y t - a , a = 1,2,3, needs to be re- 
calibrated. Since we have no exact knowledge of the distribution of the Yt, 
we have generated many realizations of size 50,000 from which we found 
that more than 95% of the observations fall in [-2.58,2.58] ([-3.14,3.14]) 
with (Tq = 0.5 (ctq = 1). We will estimate the functions {m a (x a )}a=i f° r 



G [-2.58, 2.58] ([-3.14, 3.14]) with a = 0.5 (a = 1.0), where 



mx(x 1 ) = 0, 



(6.2) 



m 2 (i2) = 1.5sin( — x 2 



7T 



E 



1.5 sin 



-Y t 
2 * 



^3(^3) 



-1.0 sin 



7T 



; x 3 



■1.0 sin -Y t 



7T 



We choose the sample size n to be 100, 200, 500 and 1000. Table 1 lists the 
average squared error (ASE) of the SPBK estimators and the constant spline 
pilot estimators from 100 Monte Carlo replications. As expected, increases 
in sample size reduce ASE for both estimators and across all combinations 
of c values and noise levels. Table 1 also shows that our SPBK estimators 
improve upon the spline pilot estimators immensely regardless of noise level 
and sample size, which implies that our second Nadaraya- Watson smoothing 
step is not redundant. 

To have some impression of the actual function estimates, at noise level 
ctq = 0.5 with sample size n = 200, 500, we have plotted the oracle estimator 
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Estimation of component #1, n = 200 



Estimation of component #1. n = 500 




Estimation of component #2. n = 200 



Estimation of component #2. n - 500 





Estimation of component #3. n = 200 



Estimation of component #3. = 500 





Fig. 1. Plots of the oracle estimator (dotted blue curve), SPBK estimator (solid red 
curve) and the 95% pointwise confidence intervals constructed by (2.13) (upper and lower 
dashed red curves) of the function components m a (x a ) in (6.2), a — 1,2,3 (solid green 
curve). 
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n=100 




— 11=200 




— n=500 




n=1«]0 




0.5 1.0 

M 

Efficiency density of component Of 




« 
3 



f 
S 




0.5 1.0 

<i>> 

Efficiency density of component #2 




<<-■) 



(d) 



Fig. 2. Kernel density plots of the 100 empirical efficiencies ofrha(x a ) to rha(x a ), com- 
puted according to (6.1): (a) Example 6.1 (a — 2, d — 3); (b) Example 6.1 (a — 3, d = 3); 
(c) Example 6.2 (a = l,d = 30); (d) Example 6.2 (a = 2,d = 30). 



(thin dotted lines), SPBK estimator m* (thin solid lines) and their 95% 
pointwise confidence intervals (upper and lower dashed curves) for the true 
functions m a (thick solid lines) in Figure 1. The visual impression of the 
SPBK estimators is rather satisfactory and their performance improves with 
increasing n. 
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To see the convergence, Figure 2(a) and (b) plots the kernel density 
estimation of the 100 empirical efficiencies for a = 2, 3 and sample sizes 
n = 100,200,500 and 1000 at the noise level do = 0.5. The vertical line at 
efficiency = 1 is the standard line for the comparison of m* (x a ) and m* (x a ). 
One can clearly see that the center of the density plots is going toward the 
standard line 1.0 with narrower spread when sample size n is increasing, 
which is confirmative to the result of Theorem 2.1. 



Example 6.2. Consider the nonlinear additive heteroscedastic model 

y * = 5>hi(^AV«) +<r(X)e t) e t L ~ d -JV(0,l), 

in which = {Xt-i, ■ ■ ■ , X t _d} is a sequence of i.i.d. standard normal ran- 
dom variables truncated by [—2.5, 2.5] and 

n Vd 5-exp(E£ = il*t-alAQ n . . 
cr(A)=(To • -j , <7n = U.l. 

2 5 + exp(ELil^lM 

By this choice of c(X), we ensure that our design is heteroscedastic, and the 
variance is roughly proportional to dimension d, which is intended to mimic 
the case when independent copies of the same kind of univariate regression 
problem are simply added together. 

For d = 30, we have run 100 replications for sample size n = 500, 1000, 1500 
and 2000. The kernel density estimation of the 100 empirical efficiencies for 
a = 1,2 is graphically represented respectively in (c) and (d) of Figure 2. 
Again one sees that with increasing n, the efficiency distribution converges 
to 1. 

Lastly, we provide the computing time of Example 6.1 from 100 repli- 
cations on an ordinary PC with Intel Pentium IV 1.86 GHz processor and 
1.0 GB RAM. The average time run by XploRe to generate one sample of 
size n and compute the SPBK estimator and marginal integration estima- 
tor (MIE) is reported in Table 2. The MIEs have been obtained by directly 
recalling the "intest" in XploRe. As expected, the computing time for MIE 
is extremely sensitive to sample size due to the fact that it requires n 2 least 
squares in two steps. In contrast, at least for large sample data, our proposed 
SPBK is thousands of times faster than MIE. Thus our SPBK estimation is 
feasible and appealing to deal with massive data sets. 



APPENDIX 

Throughout this section, a n S> b n means lim ri _ +00 'a n = and a n ~ b, 
means Imin^aobn/ a n = c, where c is some constant. 
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Table 2 

The computing time of Example 6.1 (in seconds) 



Method 


n = 100 


n = 200 


n = 400 


n = 1000 


MIE 


10 


76 


628 


10728 


SPBK 


0.7 


0.9 


1.2 


4.5 



A.l. Preliminaries. We first give the Bernstein inequality for a geometric 
a-mixing sequence, which plays an important role through our proof. 

Lemma A.l (Theorem 1.4, page 31 of [1]). Let {£t,i 6 2} be a zero-mean 
real-valued a-mixing process, S n = J27=i^.i- Suppose that there exists c > 
such that for i = l,...,n, k = 3,4, .. . ,E\^i\ k < c k ~ 2 k\E^ 2 < +oo; then for 
each n > 1, integer q G [l,n/2], each e > and k>3, 



P(\S n \>ne)<a 1 exp[-^-^—)+a 2 (k)a 



qe 2 \ (\ n i\ 2fc /(2fc+i) 



_q + l 

where a(-) is the a-mixing coefficient defined in (2.10) and 

2 \ / , 2fc/(2fc+l) 



n ( e \ « / 5m 

ai = 2- + 2 1 + 7— -o =— , a 2 (k) = lln 1 + 



k 



q ' V 257Ti2 + 5ce 
with m r = maxi<j< n ||£j|| r ) r>2. 

Lemma A. 2. Under assumptions (A4) and (A6), one has: 

(i) There exist constants Co(/) and C\{f) depending on the marginal 
densities f a (x a ), a = 1,2, such that Co(f)H < ||frj, Q ||2 < C\(f)H, where bj >a 
is given in (2.2). 

(ii) For any a = 1,2, \ J' — J\ < I, E{Bj ia (Xi a )Bj^ a (Xi a )} ~ 1; in addi- 
tion 

E\Bj ya (Xi a )Bj^ a (X ia )\ k ~ H 1 ~ k , k>\, 
where Bj )Q , and Bj/ a are defined in (2.3). 

We refer the proof of the above lemma to Lemma A. 2 in [26]. 

Lemma A. 3. Under assumptions (A4)-(A6), for n iAjJ (xi) given in (5.5), 

sup sup \/j,uj(xi)\ = 0(H 1 ^ 2 ). 
zie[o,i] i<J<n 
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Proof. Denote the theoretical norm of Ij a in (2.1) for a = 1,2, J 
1,...,N + 1, 



(A.l) 



ot\\2 



By definition, \fi UJ (xi)\ = \E{K h (Xa - xi)Bj^(Xi 2 )}\ is bounded by 
K h (m - x 1 )\B J ^ 2 (u 2 )\f{uiu 2 )du 1 du 2 



K{vi)I J+ i^{u2)f{hvi + xi,u 2 ) dvi du 2 



+ 



CJ+1,2 
C J,2 



1/2 



K(vi)I J , 2 (u 2 )f(hv 1 +xi,u 2 )dvidur 



The boundedness of the joint density / and the Lipschitz continuity of the 
kernel K will then imply that 



sup sup K(vi)I Jj2 (u2)f(hvi + x 1 ,u 2 )dv 1 du2<C K CfH. 

xiG[0,l]l<^<^"' J 

The proof of the lemma is then completed by (i) of Lemma A. 2. □ 
Lemma A. 4. Under assumptions (A2) and (A4)-(A6), one has 



(A. 2) sup sup 

xie[o,i] i<J<n 



n 



" 1 5^{ w j( x ti a; i)-A*wj( a; i)} 



l=i 



(A.3) 



sup sup 

zi€[0,l] 1<J<N 



n 1 ^2ujj{^i,xi) 
1=1 



O p (logn/V nh), 



where ojj(K.i,x\) and n Wj (x{) are given in (5.5). 



Proof. For simplicity, denote Uj(Ki,xi) = uj(Xi,x\) — ijl Wj {xi). Then 
i?{a;}(X z ,xi)} 2 = ii;^(X ; ,xi)-/i2 (x x ), 



while Eu 2 j(Xi ,x±) is equal to 



r ,2|| 2 



K\v x )\ l w (u 2 ) + CJ ^Ij,2(u 2 ) 



CJ,2 

x f(hv\ + x\,u 2 ) dv± du 2 , 



where cj jQ is given in (A.l). So Euj 2 j{^K.i,x\) ~ h 1 and Elo 2 j(X.i, x±) ^ 
//^ (xi). Hence for n sufficiently large, £^{w}(X;, xi)} 2 = £w 2 (X;,xi) — 
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^t)j{ x i) — c *h f° r some positive constant c*. When r > 3, the rth mo- 



ment E\uj(1Li, x\)\ r is 

l —JJ K r h (ui-x 1 ){l J+lt2 (u 2 ) + 



\\bj. 



Cj+1,2 
CJ,2 



Ij,2(u 2 ) \f{uiu 2 ) dui du 2 . 



It is clear that E\uij(X.i, xi)\ r ~ fv^ ^H 1 r / 2 . According to Lemma A. 3, one 
has \Eujj{yL hXl )\ T <CH r / 2 , thus E\uj{y. hXl )\ r > |/i Wj (xi)| r . In addition 



£;K(Xi,x!)r< 



(r-2) 



rlE^j^ux^ 2 , 



so there exists c* = ch~ l H~ 1 / 2 such that E\uj(Xi, xi)\ r < c^~ 2 r!£'|a;}(X/, xi)| 2 , 
which implies that {wj(X^, x\)}f =1 satisfies Cramer's condition. By Bern- 
stein's inequality, for r = 3 



1=1 



>Pn}<ai exp 



<?Pn 



25777-2 + 5c*p Tl 



+ a 2 (3)a 



q+l 



6/7 



with ml ~ h x , 7773 = maxi<j< n ||u;}(X;,a;i)||3 < {Co(2/i 1 ) 2 } 1 / 3 and 
log 77 



'nh 



«i = 2- + 2 1 + 



pl 



25772-2 + 5c*p n / ' 



5777, 

a 2 (3) = lln 1+ 3 



6/7 



Pi, 



Observe that 5c* p n = o(l); by taking q such that [^pj] ^ Co logra, q ^ c\nj log 77 
for some constants Co,c\, one has ai = 0{n/q) = O(logn), a 2 (3) = o(n 2 ). 
Assumption (A2) yields that a([^{\fl 7 < Cn~ 6XoCo/7 . Thus, for n large 
enough, 



( A - 4 ) P\Z 



n 



J2v*j(~&i, xi ] 



1=1 



> 



P log 77 

\fnh 



< cn~ C2p2 log 77 + Cn 2 " 6AoCo/7 . 



We divide the interval [0, 1] into M n ~ n 6 equally spaced intervals with dis- 
joint endpoints = x±fi < a?i,i < • • • < £i,m„ = 1. Employing the discretiza- 
tion method, 



sup sup 

zi€[0,l] 1<J<N 



n 



1=1 



(A. 5) = sup sup 

0<k<M n 1<J<N 



n 



i=x 



+ sup sup sup 

l<k<M n 1<J<N x 1 £[x ltk _ 1 ,x 1:k 



n 

n~ 1 ^T{u}(X.i,x 1 ) -u*j(X h x ltk )} 
1=1 
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By (A. 4), there exists a large enough value p > such that for any 1 < k < 



1=1 



> p(nh) 1 ^ 2 logn> <n 



-10 



which implies that 



eH. 



sup sup 

^ [0<fc<M n 1<J<7V 
oo A/„ N ( 



i=i 



i=i 



<EEE^ 

n=lk=U=l 

oo 

< NM n n~ 10 < oo. 

71=1 

Thus, the Borel-Cantelli lemma entails that 



>P 



>P 



1<J<N, 



logn 



ynh 

log re 
\fnh 



(A.6) 



sup sup 

0<k<M n 1<J<N 



n 1 E w K x «> x i,fc) 

1=1 



O p (logn/ynh). 



Employing Lipschitz continuity of the kernel K, for x\ G [x± f~—i,xi /.] 



sup \K h (X n - Xl ) - K h (X n - x ljk )\ < CkM^K 

l<k<M n 

According to the fact that M n ~ n 6 , one has 



U-2 



sup sup sup 

l<fc<A/ n l^J^Nx^^^^^k] 



n 



1=1 



< CkM~ h~ sup sup |Sj i2 (rE 2 )| 
2-' 2 e[o,i] i<J<n 

= OiM^h^H- 1 ' 2 ) = oin- 1 ). 

Thus (A. 2) follows instantly from (A. 5) and (A.6). As a result of Lemma 
A.3 and (A.2), (A.3) holds. □ 



Lemma A. 5. Under assumptions (A4) and (A6), there exist constants 
Cq > cq > such that for any a = (ao, an, . . . , a7v,i,ai,2> • • • , o,n,2), 



(A.7) co^ + ^a 2 ^^ 



a 

J,a 
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We refer the proof of the above lemma to Lemma A. 4 in [26]. The next 
lemma provides the size of a T a, where a is the least squares solution defined 
by (3.6). 

Lemma A. 6. Under assumptions (A2)-(A6), a satisfies 

N 2 

(A.8) a T a= ~a 2 + £ £ ~a\ a = O p {N (log nf /n}. 

J=la=l 

Proof. According to (3.7) and (3.8), a T B T Ba = a T (B T E). Thus 
(A.9) 



|Ba||| n = a T/ 



a = a T (n _1 B T E). 



By (A. 15), ||Ba||2 n is bounded below in probability by (1 — A^) ||Ba.|||- 
According to (A. 7), one has 



(A.10) 



|Ba||l 



N 2 



~2 



J=la=l 



Meanwhile one can show that a (n B E) is bounded above by 

N 2 



'■o 



(A.ll) 



+E 

J, a 



J,a 



1 



- yy(Xj)£i 



i=i 



1/2 



J,a I H t=l 

Combining (A.9), (A. 10) and (A. 12), the squared norm a r a is bounded by 
c 2 (l-A n )- 2 



if>(Xi)£i] +^|^Bj, Q (X ic( )(r(X i ) £i | 

i=l J J,a (. i=l ) 



Using the same truncation of e as in Lemma A.ll, the Bernstein inequality 
entails that 



i=i 



max 

KJ<N,a=l,2 



n 



1 X] - B J,a(^a) <T ( X i) e « 



i=l 



O p (log n/y/n). 



Therefore (A.8) holds since A n is of order o p (l). □ 
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A. 2. Empirical approximation of the theoretical inner product. 

Lemma A. 7. Under assumptions (A2), (A4) and (A6), one has 
(A.12) sup | (1, Bj, a ) 2) „ - (1, Sj, a ) 2 | = O^n" 1 / 2 logn), 

(A.13) sup |( J Bj iQ , J B J , iQ ) 2in -(S Jia ,S J , ia ) 2 | =O p (n^ 1 / 2 J ff- 1 / 2 logn), 

J, J', a 



(A.14) 



sup \(Bj j0t ,Bj> a ') 2jn - (-Bj )Q ,,Bj/ )Q ,/) 2 | 

\<J,J'<N,a^a' 



= O p (n- 1/2 logn). 
We refer the proof of the above lemma to Lemma A. 7 in [26]. 
Lemma A. 8. Under assumptions (A2), (A4) and (A6), one /ias 



(A.15) ^= sup Kgi^2)2,n-( gi;ff2 ) 2 l =o y logn 



gi^eGC" 1 ) ||9l||2||52||2 



P\ n l/2 H l/2 -"P 



oJl). 



Proof. For every gi, g 2 E ^, one can write 

N 2 

5i(Ai,A 2 ) =a + ^ 

Bj, a (X a ), 

J=la=l 
AT 2 

92(Xi, X 2 ) =a'o+ a 'j',a' B J',a'(X a r), 
J'=la'=l 

where for any J, J' = 1, . . . , N, a, a' = 1, 2, aj jQ and aj' jQ ' are real constants. 
Then 



|(3i,52)2,n - (31,32)2! < 



5I( a 0' a J,«-Sj,a)2,n 



+ 



,a' ,a')2,n 



J\a' 



+ l a J,all a J',a'll(- B J,«,- B J',a')2,n 
J,J',a,a' 

- (Bj,a,Bj>, a ')2\ 

= Li + L 2 + L 3 . 
The equivalence of norms given in (A. 7) and (A.12) leads to 
L\ < A n> i ■ \a' \ ■ \aj >a \ 



N 1 ' 2 
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/ \ 1/2 / \ 1/2 

<C A n Ja$ + J2a'!J [J2a 2 J,a 

\ J, a ) \J,a 

<CA,lA,,l||tt||2||fla||2-ff" 1/2 

= O p (n- 1 / 2 F- 1 / 2 logn)|| 5l || 2 || 52 || 2 . 

Opin-^H- 1 / 2 log ™) ||5i lb ||52 lb- By the Cauchy-Schwarz 
L-3 < \ a J,<x\\ a J',a'\ max(A nj2 , A ni3 ) 

J,J',a,a' 

< (7^,2 max(A nj2 , Ai^llffilbll^lb 
= O p (n- 1 / 2 F- 1 / 2 logn)|| 5l || 2 || 52 || 2 . 
Therefore, statement (A. 15) is established. □ 

A. 3. Proof of Lemma 5.2. Denote V as the theoretical inner product of 
the B spline basis {1, Bj tOC (x a ), J = 1, . . . ,N, a = 1, 2}, that is, 



Similarly, Li 
inequality 



T 



■is 



1 

2 N {Bj^Bj lQ ,) 2 )\<OL,a!<i; 

1<J,J'<N 



(A.16) V 
where P = {0, . . . , 0} T . Let S be the inverse matrix of V, that is, 



(A.17) S = V _1 



1 



On 



1 



On t n 



On Vu Via 
On V21 V 22 



Oat Sh S12 
On S21 S 22j 



Lemma A. 9. Under assumptions (A4) and (A6), /or V, S defined in 
( A.16), (A.17), there exist constants Cy > cy > and Cs > eg > suc/i that 

CV I27V+1 < V < Cy I 2 at+i, C5l 2 Ar + i < S < CsI 2 tv+1- 

We refer the proof of the above lemma to Lemma A. 9 in [26]. Next we 
denote 



V* 











2iV 



2 at (Bj t0l ,Bjr )0l i)2,n — (Bj : a,Bj' ja ')2 
Then a in (3.8) can be rewritten as 



1 < a, a' < 2." 
1<J,J'<N 



a= (B J B) 1 B T E = ( — B' J B 



(A.18) 



1. 



(V + V 



*\-i 



■n 



-B T E ). 



-1 



— B T E 

n 



11 
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Now define a = {clq, a^i, . . . , ajv,i, ai,2j • ■ • > oat^} 7 " as 
(A.19) a = V~ 1 (n- 1 B T E) = S(n~ 1 B T E), 

(2) 

and define a theoretical version of ^ in (5.6) as 

n N 

(A.20) ^ 2) (xi) = n^ 1 ]T ]T aj, 2 wj(X J ,xi). 

i=i j=i 

Lemma A. 10. Under assumptions (A2) to (A6), 

sup \V&>{xi) - ^( Xl )\ = O p {(logn) 2 /(nH)}. 

Proof. According to (A. 18) and (A.19), one has Va=(V + V*)a, 
which implies that V*a = V(a — a). Using (A. 13) and (A. 14), one obtains 
that 

||V(a-a)|| = ||V*a|| < Opin^^H' 1 logn)||a||. 
According to Lemma A. 6, ||a|| = O p (n -1 / 2 N 1 / 2 log n), so one has 

||V(a-a)|| < O p {{\ognfn- l N*/ 2 }. 
By Lemma A. 9, ||(a — a)|| = O p {(logre) 2 ra _1 Af 3 / 2 }. Lemma A. 6 then implies 

(A.21) 



|a||<||(a-a)|| + ||a|| = O p (lognJA7 



in) 



Additionally, |^£ 2) (^i) - ^ J (xi)l = lElife - aj^^^j^x^. 
So 

(logn)2 }o p (^/ 2) = o p {- (logn)2 ' 



sup |*( 2 )(xi)-£( 2 )(xi)| ^v'iVOp 

xG[0,l] 



Therefore the lemma follows. □ 



Lemma A. 11. Under assumptions (A2)-(A6), /or ^i, \Xi) as defined 
in (A.20), one has 

n N 

n -^KniXa - xy) aj, 2 Bj, 2 {X i2 ) = O p {H). 



sup \¥ 2 \ Xl )\= sup 

H6[0,l] xiG[0,l] 



i=l 



J=l 



Proof. Note that 



l^ 2) (*i)l< 



N 



J=l 



(A.22) 



+ 



N 



J=i i=i 



Qi(xi) + Q 2 (x{). 
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By the Cauchy-Schwarz inequality, (A. 21) Lemma A. 4 and assumptions 
(A5), (A6), 

(A.23) sup Q 2 (x 1 ) = O p (\ognJr^)VNO p ( 1 ^) =oJ 
xie[o,i] Wnh 



n 

Using the discretization idea again as in the proof of Lemma A. 4, one has 
sup Qi{xi) 

xi6[0,l] 



< max 

Kk<M r 



(A.24) 



N 



"J,2Mwj(^l,fc) 

J=l 



+ max sup 

i<h<M nxie[x tX ] 



N N 
3=1 J=l 



= T 1 +T 2 , 
where M n ~ n. Define next 



W\ = max 

Kk<M Tl 



W 2 = max 

Kk<M r , 



n 1 J2 ^j(x 1 ^)s J+N+lf+1 B jll (X il )a(X i )e l 

l<i<nl<J,J'<N 



n 1 Y ^j{ x i,k)sj +N+1 j' +N+1 Bj l2 {Xi2)a{^i)e l 

Ki<nKJ,J'<N 



Then it is clear that T 1 <W 1 + W 2 . Next we will show that W\ = O p (H). 
Let D n = n?°(2+g < #0 < 5), where 5 is the same as in assumption (A3). 
Define 

£ i,D = ^(N < AO. 4,D = > D n), <D = £ i,D ~ E ( £ i,D I X > 

Ui.k = ^i,k) T S 2 i{B 1 , 1 (X ll ), . . . , B 1>N (X ll )} T v(X i )sl D . 

Denote = maxi<fc<jvf n \n~ l Y^i=\ A,fc| as the truncated centered version 
of Wx. Next we show that \Wi - Wf\ = O p (H). Note that \Wi-W^\< 
Ai + A2, where 



Ai = max 

\<k<Mr, 



/ 1 . _. , 



'J+JV+l.J'+l 



i=1 1<J,J'<N 



A 2 = max 

l<fc<M„ 



^ J (^i,fc)s J+Ar+ i,j' + i B j' ) i(^i)o-(X i )4D 



i=1 1<J,J'<N 
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Let Hu}(xi >k ) = {/iwi(xi,fc), . . . ,^ N (xi,k)} T ] then 



Ai = max 

Kk<M r , 



fJ-w(xi t k) T S2i{ n 1 ^2By A (Xn)a(Xi)E(e i D \Xi 



i=l 



N 



J'=l 



2\ 1/2 



{N N ( -. n ~| 2\ 

J=l J=l I i=l J J 

By assumption (A3), one has |£(e~ D |X;)| = ^(eJ^Xi)! < M S D~ {1+S) and 
Lemma A.l entails that supj a \^Y^i=i Bj t i(Xn)a(Ki)\ = O p (logn/ ' \fn). There- 



fore 

Ai < M S D-( 1+ ^ 



max 

Kfc<A/„ 



N N ( n 



j=i 



. n . 

J=l k 2=1 



1/2 



= O p {iV£H 1+5 ) log 2 n/n} = O p (iT), 
where the last step follows from the choice of D n . Meanwhile 
f wu . nk V ^M 2+ * v g (^l £ n| 2+ ^|X n ) < ^ M 5 

n=l n=l ^ n=l ^ n n=l ^ 

since 5 > 1/2. By the Borel-Cantelli lemma, one has with probability 1, 

n 

Yl ^j( x i,k)s J+N+h j> +1 B fl {X il )a{X i )e+ D = 0, 

i=1 1<J,J'<N 

for large n. Therefore, one has \W\ — W( ) \ < Ai + A2 = O p (H). Next we will 
show that Wi = O p (H). Note that the variance of C/j k is 

^(a;i,fc) T S2i var({Bi i i(Xa), . . . , B Njl (Xa)} a(X.i)£* D )S2ifJ.ui(xi,k)- 
By assumption (A3), 4 V n < var({B 1 , 1 (X il ), . . . , S^iPQOlMX)) < C?V 11( 

var(C/ i)fe ) ~ /4^(£i,fc) T S2iViiS2i/4^(xi ifc )y e x> = /i w (xi ) jfe) T S 2 i// w (:Ei ) jO'^ ) .D, 
where V^,d = var{e* j£) |Xj.}. Let re^i,*.) = {/Uo;(^i,fe) T ^(2;i,fc)} 1/2 - Then 

c 5 c 2 {K(x lifc )} 2 K,z? < var(^ ife ) < C s Cj{K(x lik )} 2 V £iD . 
Simple calculation leads to 

E\U i>k \ r < {co^^DnH^Y^rlElU^l 2 < +00 

for r > 3, so {Ui k}f=i satisfies Cramer's condition with Cramer's constant 
c* = co^xi^DnH^ 1 / 2 ; hence by Bernstein's inequality 



» i 



1\ 6/7 
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where m| ~ {K(x hk )} 2 V E , D , m 3 < {c^Or^)} 3 ^ 1 / 2 A^,d} 1/3 , 



n 



p n = pH, oi = 2-+2 1 + 



pl 



25m\ + 5c*p n 



02(3) = lln 1 + 



5m 



6/7 



P n 



Similar arguments as in Lemma A. 4 yield that as n — ► oo 

7,2/5 



9Pn 



25m 2 , + 5c*p„ c* co (log n) b / 2 D r , 
Taking en, p large enough, one has 



+oo. 



n 



n 
i=l 



> pH)< clognexp{-c 2 p 2 logn} + Cn 2 ~ eXoCo/7 < n~ 3 , 



for n large enough. Hence 



OO OO M n / 1 n \ oo 

Y,p(\w 1 D \>pH) = Y,E p [ -E^ >ph <E M « 

n=l n=l fc=l V i=l / n=l 



n < oo. 



Thus, the Borel-Cantelli lemma entails that = O p (H). Noting the fact 
that \W 1 -Wj D \ = O p (H), one has that Wi = O p (H). Similarly W 2 = O p (H). 
Thus 



(A.25) 



T 1 <W 1 + W 2 = O p (H). 



Employing the Cauchy-Schwarz inequality and Lipschitz continuity of the 
kernel K, assumption (A5), Lemma A. 2(h) and (A. 21) lead to 

•iVV2logn^{Ell^ 2 (^l 2 )} 1/2 , _ 1/2 , 

=Op(n 1/z ). 



(A.26) T 2 < O f 



n 



1/2 



h 2 M n 



Combining (A. 24), (A.25) and (A.26), one has sup Xlg [ ,i] Qi( x i) = P (H). 
The desired result follows from (A. 23) and (A. 23). □ 
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